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ABSTRACT 

Eigenvectors of data matrices play an important role in 
many computational problems, ranging from signal process- 
ing to machine learning and control. For instance, algo- 
rithms that compute positions of the nodes of a wireless net- 
work on the basis of pairwise distance measurements require 
a few leading eigenvectors of the distances matrix. While 
eigenvector calculation is a standard topic in numerical lin- 
ear algebra, it becomes challenging under severe communi- 
cation or computation constraints, or in absence of central 
scheduling. In this paper we investigate the possibility of 
computing the leading eigenvectors of a large data matrix 
through gossip algorithms. 

The proposed algorithm amounts to iteratively multiplying 
a vector by independent random sparsification of the original 
matrix and averaging the resulting normalized vectors. This 
can be viewed as a generalization of gossip algorithms for 
consensus, but the resulting dynamics is significantly more 
intricate. Our analysis is based on controlling the conver- 
gence to stationarity of the associated Kesten-Furstenberg 
Markov chain. 

Categories and Subject Descriptors 

C.2.4 [Computer-Communication Networks]: Distributed 
Systems — Distributed applications 

General Terms 

Algorithms, Performance 

1. INTRODUCTION AND OVERVIEW 

Consider a system formed by n nodes with limited compu- 
tation and communication capabilities, and connected via 
the complete graph K^- To each edge (i, j) of the graph is 
associated the entry Mij of an n x n symmetric matrix AI. 
Node i has access to the entries of Mij for j G {1, . . . , n}. 
An algorithm is required to compute the eigenvector of M 
corresponding to the eigenvalue with the largest magnitude. 



Denoting by u G M" the eigenvector, each node i has to com- 
pute the corresponding entry Ui. The eigenvector u is often 
called the principal component of M, and analysis methods 
that approximate a data matrix by its leading eigenvectors 
are referred to as principal component analysis [19| . 

Eigenvector calculation is a key step in many computational 
tasks, e.g. dimensionality reduction [29], classification [17| , 
latent semantic indexing [T], link analysis (as in PageRank) 
The primitive developed in this paper can therefore 
be useful whenever such tasks have to be performed under 
stringent communication and computation constraints. As a 
stylized application, consider the case in which the nodes are 
n wireless hand-held devices (for related commercial prod- 
ucts, see [l] [2j [3]). Accurate positioning of the nodes in 
indoor environments is difficult through standard methods 
such as GPS [26]. Because of intrinsic limitation of GPS 
and of roof scattering, indoor position uncertainty can be of 
10 meters or larger, which is too much for locating a room 
in a building. An alternative approach consists in measur- 
ing pairwise distances through delay measurements between 
the nodes and reconstructing the nodes positions from such 
measurements (obviously this is possible only up to a global 
rotation or translation). Positions indeed can be extracted 
from the matrix of square distances by computing its three 
leading eigenvectors (after appropriate centering) [24]. This 
method is known as multidimensional scaling, and we will 
use it as a running example throughout this paper. 

A simple centralized method for computing the eigenvec- 
tor is to collect all the matrix entries at one special node, 
say node i, to perform the eigenvector calculation there and 
then flood back its entries to each node. This centralized 
approach has several disadvantages. It requires communi- 
cating real numbers through the network at the begin- 
ning of execution, and puts a large memory, computation 
and communication burden on node i. It is also very fragile 
to failure or Byzantine behavior of i. 

The next simplest idea is to use some version of the power 
method. A decentralized power method would proceed by 
synchronized iterations through the network. At t-ih iter- 
ation, each node keeps a running estimate a;^'' of the leading 
eigenvector. This is updated by letting a^i*"*^^^ = Mijx^* 
If M has strong spectral features (in particular, if the two 
largest eigenvalues are not close) these estimates will con- 
verge rapidly. On the other hand, each iteration requires 
(n — 1) real numbers to be transmitted to each node, and 



n sums and multiplications to be performed at the node. 
In other words, the node capabilities have to scale with the 
network size. This problem becomes even more severe for 
wireless devices, which are intrinsically interference-limited. 
Within the power method approach, communications 
have to be scheduled at each time thus requiring significant 
bandwidth. Finally, the algorithm requires complete syn- 
chronization of the communications and is fragile to link 
failures (which can be quite frequent e.g. due to fading). 

A simple and yet powerful idea that overcomes some of these 
problems is sparsification. Throughout the paper, we say 
that S £ K"*^" is a sparsification of M if it is obtained 
by setting to some of the entries of M and (eventually) 
rescaling the non-zero entries. A sparsification is useful if 
most of its entries are zero, and yet the resulting matrix 
has a leading eigenvector close to the original one. Given 
a sparsified matrix S, power method can be applied by 
_ SijX^p . If S has d-nonzero entries per row, 

each node needs to communicate d real numbers, and to per- 
form d sums and multiplications. For wireless devices, the 
badwidth scales at most like nd. 

In [4] Achlioptas and McSherry showed that a sparsification 
can be constructed such that 

\\M-S\\2<e\\M\\2, (1) 

with only d = 0{l/d^) non-zero entries per row. The in- 
equality ([1]) immediately implies that computing the lead- 
ing eigenvector of S, yields an estimator u that satisfies 
11^— m|| < 29. (Here and below, for v,w £ K™, v* denotes 
its transpose and {v,w) — v*w denotes the scalar product 
of two vectors. Let ||w|| = {v,v} denote its Euclidean -or £2- 
norm, i.e. \\v\\^ = X^ILi^?- ^'-'^ ^ matrix A, \\A\\2 denotes 
its £2 operator norm, i.e. ||j4||2 = sup„^Q || .) The 
construction of ^ is based on random sampling. Each en- 
try of M is set to independently with a given probability 
1 — p = 1 — d/n. Non-zero entries are then rescaled by a 
factor 1/p. The bound |T| is proved to hold with high prob- 
ability with respect to the randomness in the sparsification. 

While this approach is simple and effective, it still presents 
important shortcomings: (i) For a fixed per node complex- 
ity which scales like 1/0^, this procedure achieves precision 
6: can one achieve a better scaling? (it) A fixed subnetwork 
G of the complete graph (corresponding to the sparsity pat- 
tern of S) needs to be maintained through the whole process. 
This can be challenging in the presence of fading or of node 
failures/departures, [in) The target precision is to be de- 
cided at the beginning of the process, when the sparsification 
is constructed. 

In this paper we use sparsification as a primitive and pro- 
pose a new way to exploit its advantages. Roughly speaking 
at each round t a new independent sparsification S*-'' of M 
is produced. Estimates of the leading eigenvector are gener- 
ated by applying 5'*', i.e. through 

^ , (2) 

and then averaging across iterations m'*^ oc X]f<t a^^^Vll^^'^' II • 
We will refer to this algorithm as GossiP PGA. In the limit 
case m which S'(') are in fact deterministic and coincide with 
a fixed S, the present scheme reduces to the previous one. 



However, general independent random sparsifications S*-*^ 
can model the effect of fading, short term link failures, node 
departures. (While complete independence is a simplistic 
model for these effects, it should be possible to include short 
time-scale correlations in our treatment.) Finally, the use of 
truly random, independent sparsifications might be a choice 
of the algorithm designer. 

Does the time-variability of S'*' deteriorate the algorithm 
precision? Surprisingly, the opposite turns out to be true: 
Using independent sparsifications appears to benefit accu- 
racy by effectively averaging over a larger sample of the 
entries of M. As an example consider the sparsification 
scheme mentioned above, namely each entry of 5'*' is set 
to independently with a fixed probability 1 — p. Then, 
with respect to the total per-node computation and com- 
munication budget, scaling of the £2 error ||u — u\\ remains 
roughly the same as in the time-independent case (see Sec- 
tion |3|. Remarkably, the way optimal accuracy is achieved 
is significantly different from the one that is optimal within 
the time-independent case. In the latter case it is optimal 
to invest resources in the densest possible sparsification S, 
and then iterate it a few times. Within the present approach, 
one should rather use much sparser matrices S'*-* and iterate 
the basic update Q many more times. The use of sparser 
subnetworks is advantageous both for robustness and the 
overhead of maintaining/synchronizing such networks. 

Our main analytical result is an error bound for the time- 
dependent iteration ([2|, that takes the form 

-u\\<c {e/Vt + \og{i/ef) , (3) 

with a constant C explicitly given below. Notice that, for t 
large enough, this yields an error roughly of size 9^ . While 
using the same number of communications per node, this is 
significantly smaller than the error 9 obtained by computing 
the leading eigenvector of a single sparsification. 

The upper bound Q holds under the following three as- 
sumptions: (i) \\M - 5''*'||2 < 6*11^112 for aU £ < t; {ii) 
E(S('^') = M; [Hi) S'-''' invertible for all £. Further it is re- 
quired that the initial condition satisfies Ha;'"' — u\\ < C9. 
This can be generated by iterating a fixed sparsification (say 
S''^') for a modest number of iterations (roughly log(l/6')). 
Numerical simulations and heuristic arguments further sug- 
gest that the last assumption is actually a proof artifact and 
not needed in practice (see further discussion in Section [272] ). 

The rest of the paper is organized as follows: Section [2] pro- 
vides a formal description of our algorithms and of our gen- 
eral performance guarantees. In Section |3] we discuss impli- 
cations of our analysis in specific settings. Section [4] reviews 
related work on randomized low complexity methods. Sec- 
tion [S] describes the proof of our main theorem. This lever- 
ages on the theory of products of random matrices, a line 
of research initiated by Furstenberg and Kesten in the six- 
ties [15], with remarkable applications in dynamical systems 
theory [25]. The classical theory focuses however on matri- 
ces of fixed dimension, in the limit of an infinite number of 
iterations, while here we are interested in high-dimensional 
(large n) applications. We need therefore to characterize 
the tradeoff between dimensions and number of iterations. 



In Section [6] we provide the proof of the technical lemmas 
used in the main proof. Finally, Section [7] discusses extend- 
ing our algorithm to estimate the largest eigenvalues and 
provides a general performance guarantee. 

2. MAIN RESULTS 

In this section, we spell out the algorithm execution and 
state the main performance guarantee. 

2.1 Algorithm 

As mentioned in the previous section M £ R"^" is a sym- 
metric matrix, with eigenvalues Ai, A2, . . . , A„. Without loss 
of generality, we assume that the largest eigenvalue Ai is pos- 
itive. Further, we assume Ai > |A2j strictly. We will also 
write A = Ai and u for the corresponding eigenvector. We 
assume to have at our disposal a primitive that outputs a 
random sparsification S of M . A sequence of independent 
such sparsifications will be denoted by {S''^', S'^', . . . }. In 
the next two paragraphs we describe a centralized version of 
the algorithm, and then the fully decentralized one. 

2.1.1 Centralized algorithm 

The system is initialized to a vector a;^"-* G R". Then we 
iteratively multiply the i.i.d. sparsifications S^^\ S^'^\ ... to 
get a sequence of vectors x'^^\ x^'^\ .... After t iterations, 
our estimate for the leading eigenvector u is 
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with c{t) the appropriate normalization to ensure ||S'') || = 1. 

Note that, even after normalization, there is a residual sign 
ambiguity: both u and ~u are eigenvector. When in the 
following we write that approximate u within a certain 
accuracy, it is understood that u'*) does in fact approximate 
the closest of u and —u. A more formal resolution of this 
ambiguity uses the projective manifold define in Section [S] 

2.1.2 Decentralized algorithm 

The algorithm described so far uses the following; operations: 
(i) Multiplying vector x^^'^^ by S^^\ cf. Eq. If 5"'*' has 
dn non-zero elements, this requires 0{d) operations per node 
per round. 

(a) Computing the normalizatio- 

Since ||a;''^-'||^ = X]r=i(^i^')^' ^^^^ ^^^'^ ^^'^ ^e performed via 
a standard gossip algorithm. This entails an overhead of 
log(l/e) per node per iteration for a target precision e. We 
will neglect this contribution in what follows. 
(in) Averaging normalized vectors across iterations, cf. Eq. Q. 
Since node i keeps the sequence of estimates x^',. . . , a;'*', 
this can be done without communication overhead, with 
0(1) computation per node per iteration. 

Finally the normalization constant c{t) in Eq. Q needs to 
be computed. This amounts to computing the norm of the 
vector on the right hand side of Eq. Q, which is the same 
operation as in step (2) (but has to be carried out only once). 
From this description, it is clear that operation (1) (matrix- 
vector multiplication) dominates the complexity and we will 
focus on this in our discussion below and in Section [S] 



2.2 Analysis 

The algorithm design/optimization amounts to the choice 
of number of iterations t and the the sparsification method, 
which produces the i.i.d. matrices {S'^'}. The latter is char- 
acterized by two parameters: 9 which bounds the sparsifica- 
tion accuracy as per Eq. ([T]), and d, the average number of 
non-zero entries per row, which determines its complexity. 

The trade-off between d and 9 depends on the sparsification 
method and will be further discussed in the next section. 
Our main result bounds the error of the algorithm in terms 
of 9, t and of a characteristic of the matrix M , namely the 
ratio of the two largest eigenvalues I2 = IA2I/A. The proof 
of this theorem is presented in Section [S] 

Theorem 2.1. Let{S''^^i>i be a sequence of i.i.d. nxn 
random matrices such that £[5"'^'] = M, ||S'*'^' - M\\2 < 
^||Af||2, S^^' is almost surely non-smgular, and there is no 
proper subspace V C R" such that S^^^V C V almost surely. 
Further, let x*^"^ € R" be such that \\x'^°'> - u\\ < 9/(1 - h) 
for the leading eigenvector of u. Let the eigenvector esti- 
mates be defined as per Eq. ^ and Finally assume 
9 < (1/40)(1 - hf^'^ and let h = IA2I/A. 

Then, with probability larger than 1 — max(<5, 16/n^), 

189 /fllr,0-n/flU2 
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The assumption on the samples {S^^^}e>o are rather mild. 
The matrix whose eigenvector we are computing is the ex- 
pectation of S'*-* , the variability of S^-^^ is bounded in oper- 
ator norm, and finally the 5'^' are sufficiently random (in 
particular the do not share an eigenvector exactly). The 
latter can be ensured by adding arbitrarily small random 
perturbation to S*-^' . 

At first sight, the assumption ||a;*-'^' — it|| < 9/(1 — I2) on 
the initial condition might appear unrealistic: the algorithm 
requires as input an approximation of the eigenvector u. A 
few remarks are in order. First, the accuracy of the output, 
see Eq. (IB]), is dramatically higher than on the input for 
t — 0,(1/9 ). In the following section, we will see that this 
is indeed the correct scaling of t that achieves optimal per- 
ormance. Second, numerical simulations show clearly that, 
for x'*) = 2;(')/||a;(')||, the condition ||x(*) - m|| < 0/(1 - /a) 
is indeed satisfied after a few iterations. The heuristic argu- 
ment is that the leading eigenvectors of S''^\ 5*^', . . . S**' 
are roughly aligned with u, and their second eigenvalues are 
significantly smaller. Hence the scalar product Zt = (u, x*^*^) 
behaves approximately as a random walk with drift pushing 
out of Zt = 0. Even if Zo = 0, random fiuctuations produce 
a non- vanishing Zt, and the drift amplify this fluctuation 
exponentially fast. The arguments in Section [5] further con- 
firm this heuristic argument. For instance we will prove that 
the set llx'*' —'"11 < 9/(1 — 12) is absorbing, in the sense that 
starting from such a set, the power iteration keeps x'*' in the 
same set. On the other hand, starting from any other point, 
there is positive probability of reaching the absorbing set. 
Finally, further evidence is provided by the fact that ran- 
dom initialization is sufficient for the eigenvalue estimation 
as proved in Section [7] 



As an example, we randomly generated a marix M and com- 
puted x''-* = a:'''^/||a;'''^ || according to ([2| using random spar- 



sifications with dn entries. Let r = argmint{||Xj 



(t) 



o " Lii rand ^ ii — _ {t — 1) J 

0.001}. The subscript denotes two different initializations: tmg x = bx and 



In order to compute a computation-accuracy tradeoff we 
need to link the accuracy to t and 9. Let us first consider 
the case in which a single sparsification S is used by let- 
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This procedure 



is initialized with i.i.d Gaussian entries, and Xu 



(0) 



The following result illustrates that after a few iterations t — 
0(Iog(l/e')), x'*' achieves error of order 6 with d = 0{l/e^) 
operations per node per round. 
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Finally, constructing a rough approximation of the leading 
eigenvector is in fact an easy task by multiplying the same 
sparsification S^"^ a few times. This claim is made precise 
by the following elementary remark. 

Remark 2.2. Assume that x'"' have i.i.d. components 
N(0, 1/n), and define a;'*' = ^''^a;'*"^' where for t < t^, 
S''*' = S ts time independent and satisfies \\S — M\\2 < 
(6>V2(1 - Z2))||M||2. Ift, > 31og(n/0)/(l h - 6), then 
llx'**-* — u|| < 9/(1 — I2) with probability at least 1 — 1/n^. 

The content of this remark is fairly intuitive: the principal 
eigenvector of S is close to u, and the component of x'*' along 
it grows exponentially faster than the other components. A 
logarithmic number of iterations is then sufficient to achieve 
the desired distance from u. 

Finally, consider the assumption EfS''*-'] = M. In practice, it 
might be difficult to produce unbiased sparsifications: does 
Theorem |2.1| provide any guarantee in this case? The answer 
is clearly affirmative. Let E[S^''] = M' and assume \\AI — 
M'\\2 < 9'\\M\\2. Then, it follows immediately from ^ that 
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In other words the eigenvector approximation degrades grace- 
fully with the quality of the sparsification. 

3. EXAMPLES AND APPLICATIONS 

In this section we apply our main theorem to specific settings 
and point out possible extensions. 

3.1 Computation-accuracy tradeoff 

As mentioned above, Theorem |2. 1| characterizes the scaling 
of accuracy with the quality of the sparsification procedure. 
For the sake of simplicity, we will consider the case in each 
entry of M is set to independently with a fixed probability 
1 — d/n, and non-zero entries are rescaled. In other words 
Sij = {n/d)Mij with probability (d/n), and Sij = other- 
wise. This scheme was first analyzed in [4], but the estimate 
only holds for d > (81ogn)^. This condition was refined in 
[22| . Noting that for d > logn the maximum number of 
entries per row is of order d, the latter gives 

||M - SII2 < {C/Vd)\\M\\2 = 9\\M\\2 . 

In other words i.i.d. sparsification of the entries yields 9 — 
0{1/Vd). Further, denoting the total complexity per node 
by Xi w6 have x id either in terms of communication or 
of computation. 



converges exponentially fast to the leading eigenvector of S 
which in turn satisfies -u|| < 29 < C'{l/Vd). There- 

fore if we denote by the corresponding 

error after t iterations, we have 



' + e- 



where we deliberately omit constants since we are only in- 
terested in capturing the scaling behavior. 

Now we assume that we have a limit on the total com- 
plexity X ~ id,, and minimize the error ApM under this 
resources constraint, using the relation 9 ~ 1/^/d. A simple 
calculation shows that the smallest error is achieved when 
t = e(logx) yielding 



ApM ~ \/(logx)/x 



(6) 



Next consider the algorithm developed in the present pa- 
per. Gossip PCA. The only element to be changed in our 
analysis is the relation between accuracy and the parame- 
ters 9 and t. From Theorem |2JJ we know that our estimator 



achieves error Ago 



u\\ that scales as 



^Gossip 



9/Vt + {9\og{l/9)y 



where again we omit constants. It is straightforward to min- 
imize this expression under the constraints x ~ td, and 9 ~ 
1 / -\/d. The best scaling is achieved when t = 0(-^/x/ (log x)^) 
and 9 = e(l/(x^'"' logx)) yielding 



AgossIp ~ 1/ ■ 



(7) 



Comparing ([6| and ([7|, the scaling of the error with the 
per-node computation and communication remains roughly 
the same up to a logarithmic factor. Surprisingly, the way 
the best accuracy is achieved is significantly different. In 
the time-independent case (the standard power method), it 
is optimal to invest a lot of resources in one iteration with 
a dense matrix S that has d = 6(x/logx) non-zero entries 
per row. In return, only a few iterations t — 0(logx) are 
required. Within the proposed time-dependent gossip ap- 
proach, one should rather use a much sparser matrices 
with d — 0(y^(logx)^) non-zero entries per row and use a 
larger number of iterations t — 6(-^/x/(log x)^)- 

To illustrate how the two gossip algorithms compare in prac- 
tice, we present results of a numerical experiment from the 
positioning application. From 1000 nodes placed in the 2- 
dimensional unit square uniformly at random, we define the 
matrix of squared distances. Let pi be the position of node i, 
then Dij = \\pi — Pj\\^ ■ After a simple centering operation, 
the top two eigenvectors reveal the position of the nodes 
up to a rigid motion (translation and/or rotation) 24 . We 
can extend the gossip algorithms to estimate the first two 
eigenvectors as explained in Section |3.3[ Let the columns 
of f/ G R^°' '^ 



be the first two eigenvectors and || ■ 

1^11^ = E. 



\f be 
j ■ 



the Frobenius norm of a matrix such that 

Denote by A{d) = {1/\^)\\U — U\\f the resulting error for 
a particular choice of d. 
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Figure 1: Eigenvector estimation error against com- 
plexity. In the inset tlie result is plotted in log-scale. 

To simulate a simple gossip setting with constrained com- 
munication, we allow d to be either 50 or 500. For the two 
gossip algorithms and for each value of the total complexity 
X, we plot the minimum error achieved using one of the two 
allowed communication schemes: minrfg^so^soo} ^(d)- For 
comparison, performance of the power method on complete 
dense matrices is also shown (see Section [TJ . As expected 
from the analysis, GossiP PCA achieved smaller error with 
sparse matrices (d — 50) for all values of x- When a single 
sparsification is used, there is a threshhold at x = 14500, 
above which a dense matrix (d = 500) achieved smaller er- 
ror. Notice a discontinuity of the derivative at the threshold. 

3.2 Comparison with gossip averaging 

Gossip methods have been quite successful in computing 
symmetric functions of data {a;^'^''}i<i<„ available at the 
nodes. The basic primitive in this setting is a procedure 
computing the average X^ILi •'^i'^V'^- This algorithm shares 
similarities with the present one. One recursively applies 
independent random matrices P^^\ P^^\ . . . according to: 

x« = P«:e('-i) , (8) 

where P*-'-* is the matrix that averages entries i{t) and j(t) 
of a;'-'-' (in other words it is the identity outside a 2 x 2 block 
corresponding to coordinates i{t) and j{t)). 

It is instructive to compare the two problems. In the case 
of simple averaging, one is interested in approximating the 
action of a projector P, namely the matrix with all entries 
equal to 1/n. In eigenvector calculations the situation is 
not as simple, because the matrix of interest M is not a 
simple projector. In both cases we approximate this action 
by products of i.i.d. random matrices whose expectation 
matches the matrix of interest. However in averaging, the 
leading eigenvector of P is known a priori, it is the constant 
vector u — {1/^/n, . . . , l/^/n). As a consequence, sparsifica- 
tions P*-*' can be constructed in such a way that P(*' u = u 
with probability 1. 

Reflecting these differences, the behavior of the present algo- 
rithm is qualitatively different from gossip averaging. Within 
the latter a;'*' converges asymptotically to the constant vec- 
tor, whose entries are equal to X]r=i ^i^V*^- The conver- 
gence rate depends on the distribution of the sparsification 
P'-*'. In Gossip PCA, the sequence of normalized vectors 
a;'-*'/!!^^'*^ II does not converges to a fixed point. The distri- 
bution of a;'-*'/l|2^^''ll instead converges to a non-trivial sta- 



tionary distribution whose mean is approximated by '. 
An important step in the proof of Theorem |2.1| consists in 
showing that the mean of this distribution is much closer to 
the eigenvector than a typical vector drawn from it. 

3.3 Extensions 

It is worth pointing out some extensions of our results, and 
interesting research directions: 

More than one eigenvector. In many applications of inter- 
est, we need to compute r leading eigenvectors, where r is 
larger than one, but typically a small number. In the case 
of positioning wireless devices, r is consistent with the am- 
bient dimensions, hence r = 3. As for the standard power 
iteration, the algorithm proposed here can be generalized to 
this problem. At iteration t, the algorithm keeps track of 
r orthonormal vectors x^^\l), . . . x^*\r). In the distributed 
version, node i stores the i-th coordinate of each vector, thus 
requiring 0(r) storage capability. The vectors are updated 
by letting 5;'*' (a) = S''*^ x^'^\a). and then orthonormalizing 
£'*'(!), ...S'*'(r) to get x<*)(l), ...a;'*)(r). Orthonormal- 
ization can be done locally at each node if it has access to 
the Gram matrix G = {Gab)-i<a,b<r 

1 " 

Gat^^J2^f\a)x^P{b) (9) 
1=1 

This can be computed via gossip averaging, using messages 
consisting of r(r -|- l)/2 real numbers. Therefore the total 
communication complexity per node per iteration is of order 
r'^log(l/£) to achieve precision e. Indeed, such distributed 
orthonormalization procedure was studied in ^l] for decen- 
tralized implementation of the standard power method. 

Richer stochastic models for random sparsification. Our 
main result holds under the assumption that S'^' , S^^\. . . ,5'*^ 
are i.i.d. sparsifications of the matrix M. This is a reason- 
able assumption when the random sparsifications are gen- 
erated by the algorithm itself. The same assumption can 
also model short time-scale link failures, as due for instance 
to fast fading in a wireless setting. On the other hand, a 
more accurate model of link failures would describe S*-^', 
5''^-',. . . ,5''*-' as a stochastic process. We think that our main 
result is generalizable to this setting under appropriate er- 
godicity assumptions on this process. More explicitly, as 
long as the underlying stochastic process mixes (i.e. loses 
memory of its initial state) on time scales shorter than t, 
the qualitative features of Theorem |2.1| should remain un- 
changed. Partial support of this intuition is provided by the 
celebrated Oseledets' multiplicative ergodic theorem that 
guarantees convergence the exponential growth rate of Hx''' || 
in a very general setting [25] (namely within the context of 
ergodic dynamical systems) . 

Communication constraints: Rate and noise. In a decen- 
tralized setting, it is unavoidable to take into consideration 
communication rate constraints and communication errors. 
The presence of errors implies that the actual matrix used at 
iteration t is not S*'*' but is rather a perturbation of it. The 
effect of noise can then be studied through Theorem |2.1[ 
Rate constraints imply that real numbers cannot be com- 
municated through the network, unless some quantization 
is used. An approach consists in using some form of ran- 



domized rounding for quantization. In this case, the effect 
of quantization can also be studied through Theorem |2.1[ 
This implies that, roughly speaking, the error in the eigen- 
vector computed with this approach scales quadratically in 
the quantization step. (Notice that quantization also affects 
the vector on the right-hand side of Eq. ([2|, but we expect 
this effect to be roughly of the same order as the effect of the 
quantization of S*'*'.) Further, when the matrix M itself is 
sparse, or a fixed sparsification S is used within the ordinary 
power method, Theorem |2.1| can be used to study the effect 
of noise and quantization. 

4. RELATED WORK 

The need for spectral analysis of massive data sets has moti- 
vated a considerable effort towards the development of ran- 
domized low complexity methods. A short sample of the 
theoretical literature in this topic includes [Oj [4] [lOj [T4j 
[T2| [TI] . Two basic ideas are developed in this line of re- 
search: sparsify of the original matrix M to reduce the cost 
of matrix-vector multiplication; apply the matrix AI to a 
random set of vectors in order to approximate its range. 
Both of these approaches are developed in a centralized set- 
ting where a single dataset is sent to a central processor. 
While this allows for more advanced algorithms than power 
iteration, these algorithms might not be directly applicable 
in a decentralized setting considered in this paper, where 
each node has limited computation and comunication capa- 
bility and the datasets are often extremely large such that 
the data has to be stored in a distributed manner. 

Fast routines for low-rank approximation are useful in many 
areas of optimization, scientific computing and simulations. 
Hence similar ideas were developed in that literature: we 
refer to [Tg] for references and an overview of the topic. 

Kempe and McSherry [2l] studied a decentralized power it- 
eration algorithm for spectral analysis. They considered ma- 
trices that are inherently sparse. Therefore, no sparsification 
is used and all the entries are exploited at every iteration. 
Hence, their algorithm eventually computes the optimal low- 
rank approximation exactly. The same paper introduced the 
decentralized orthonormalization mentioned in Section [3.31 



The idea of using a sequence of distinct sparsifications to 
improve the accuracy of power iteration was not studied in 
this context. Somewhat related is the basic idea in ran- 
domized algorithms for gossip averaging [s]. As discussed 
in Section |3.2[ these algorithms operate by applying a se- 
quence of i.i.d. random matrices to an initial vector of data. 
The behavior and analysis is however considerably simpli- 
fied by the fact that these matrices share a common leading 
eigenvector, that is known a priori, namely the eigenvector 
u = (l/-^/n, . . . , f/y'n). Overviews of this literature is pro- 
vided by [27] and [S] . Quantization is an important concern 
in the practical implementation of gossip algorithms, and 
has been studied in particular in the context of consensus 
[20 13 . As discussed in the last section, the effect of ran- 
domized quantization can also be included in the present 
setting. 



ing to study the effect of such sparsification methods in the 
present setting. 

5. PROOF OF THE MAIN THEOREM 

In this section, we analyze the quality of the estimation u*-*' 
provided by our algorithm and prove Theorem |2.1[ Before 
diving into the technical argument, it is worth motivating 
the main ideas. We are interested in analyzing the random 
trajectory {x^^^}t>o defined as per Eq. One difficulty is 
that this process cannot be asymptotically stationary, since 
a;'*' gets multiplied by a random quantity. Hence it will 
either grow exponentially fast or shrink exponentially fast. 

A natural solution to this problem would be to track the 



normalized vectors a;'-'-' 



,W/||2.W| 



Also this approach 



presents some technical difficulty that can be grasped by 
considering the special case in which 5'*) = M for all t (no 
sparsification is used). Neglecting exceptional initial condi- 
tions (such that (a;^"' , u) = 0) this sequence can either con- 
verge to u or to —It. In particular, it cannot be uniformly 
convergent. The right way to eliminate this ambiguity is to 
track the unit vectors a;'*' 'modulo overall sign'. The space 
of unit vectors modulo a sign is the projective space Pn, that 
we will introduce more formally below. 

We are therefore naturally led to consider the random tra- 
jectory {xi}t>o -indeed a Markov chain- taking values in 
the projective space Xt £ P„. We will prove that two im- 
portant facts hold under the assumptions of Theorem |2.1[ 
(1) The chain converges quickly to a stationary distribution 
fi; (2) The distance between the baricenter of u and u is of 
order 9'^. Fact (1) implies that cf. Eq. H, is a good 
approximation of the baricenter of fi. Fact (2) then implies 
Theorem O 



In the next subsection we will first define formally the pro 
cess {xt}t>o, and provide some background (Section 5.11 



and then present the formal proof (Section 5.21, along the 
lines sketched above. 

5.1 The Kesten-Furstenberg Markov chain 

As anticipated above, we shall denote by P„ the projective 
space in R". This is defined as the space of lines through the 
origin in R". Equivalently, P„ is the space of equivalence 
classes in R" \ {0} for the equivalence relation ~p, such 
that X ~p y if and only if x = Xy for some A G R \ {0}. 
This corresponds with the description given above, since it 
coincides with the space of equivalence classes in 5" = {a: £ 
R" : ||a;|| = 1} for the equivalence relation ~p, such that 
X ~p y if and only if x = \y for some A G { + 1, — 1}- 

In the future, we denote elements of P„ by boldface letters 
X, y, z, . . . and the corresponding representatives in R" by 
x,y, z, . . . . We generally take these representatives to have 
unit norm. We use a metric on this space defined as 



rf(x,y) = \/l - {x,yy . 
Random elements in P„ will be denoted by boldface capitals 



Finally, there has been recent progress in the development 
of sparsification schemes that imply better error guarantees 
than in Eq. ([T|, see for instance [28]. It would be interest- 



An invertible matrix 5* G R"^" acts naturally on P„, by 
mapping x G Pn (with representative x) to the element y G 



P„ with representative of ^a; (namely the line through 5*3;, 
or the unit vector 5a;/||Sa;|| modulo sign). We will denote 
this action by writing y = Sx, but emphasize that it is a 
non-linear map, since it implicitly involves normalization. 

Given a sequence of i.i.d. random matrices {S''-*'}t>i that 
are almost surely invertible, with common distribution ps, 
we define the Markov chain {Xt}t>o with values in P„ by 
letting 

Xj = 5W5'(*-i)...5{i)Xo, (10) 

for all t > 1. We assume the following conditions: 

LI. There exists no proper linear subspace V C R" such 
that S'^^^V C V almost surely. 

L2. There exist a sequence {S^^^}t>i in the support of ps, 
such that letting = S'^'S^^^^' • ■ • 5'(i\ we have 
ff2(S^)/cri(5'^) ^ as T ^ oo. 

It was proved in [23], that, under the assumptions LI and 
L2, there exists a unique measure fi on P„ that is stationary 
for the Markov chain {Xt}. The Markov chain converges to 
the stationary measure as t <x (we refer to the Appendix 
for a formal statement). 



For the purpose of proving Theorem |2.1[ uniqueness of the 
stationary measure is not enough: we will need to control 
the rate of convergence to stationarity. We present here a 
general theorem to bound the rate of convergence, and we 
will apply it to the chain of interest in the next section. Let 
us start by stating two more assumptions. We denote by 
G C P„ a (measurable) subset of the projective space, and 
assume that there exists a constant p £ (0, 1) such that 

Al. For any x G G, ^'''x G G almost surely. 

A2. For any X / y G G, E [d(S'(''x, ^'''y)] < pd{x,y). 

We then have the following. 

Theorem 5.1. Assume conditions LI and L2 hold, to- 
gether with Al and A2. Denote by p the unique stationary 
measure of the Markov chain {Xt}t>o. Then 

m(g^) = o. 

Further, if Xq G G then for any L-Lipschitz functioi^ f : 
Pn K, we have 

jE[/(XO]-/i(/)| < Lp'. 



The proof of this Theorem uses a coupling technique anal- 
ogous to the one of [23]. We present it in the appendix for 
greater convenience of the reader. 

5.2 Proof of Theorem [23] 

In this section we analyze the GossiP PGA algorithm us- 
ing the general methodology deve lope d above. In particu- 
lar, we consider the Markov chain \lo\ whereby {S^^^}i<e<t 
are i.i.d. sparsifications of M satisfying the conditions: (i) 

ll^t*) - A'/||2 < e||M||2; (ii) E[S'-'^'>] = M; (m) S'-''' is almost 

""^We say that / is L-Lipschitz if, for any x,y G Pn, l/(x) — 
/(y)| <Ld(x,y). 



surely non-singular. Throughout the proof, we let u G Pn 
denote an element of Pn represented by u. 

Note that the conditions LI stated in the previous section 
holds by assumption in Theorem 2.1 Further let Ai(^) and 



X2{£) the largest and second largest singular values of S^^K 
By assumption (i), and since by hypothesis 9 < (1/40) (1 — 
/a)'/', implying ||S(''-A.f||2 < (A-|A2|)/2, we have |Ai(^)/A2(^)| 
> 1 almost surely. Hence by taking S^^^ = S^^' = • • • = 
5''^-' — . . . in the support of ps, we have that condition L2 
holds as well. 

By applying the main theorem in [23] (restated in the Ap- 
pendix), we conclude that there exists a unique stationary 
distribution p for the Markov chain {Xt}, and that the chain 
converges to it. 



We next want to apply Theorem |5.1| to bound the support 
and the rate of convergence to this stationary distribution. 
We define the 'good' subset G C Pn by 



x G Pn : d(x, u) < 



26 
1-/2 



(11) 



Our next lemma shows assumptions Al and A2 are satisfied 
in this set G, with a very explicit expression for the contrac- 
tion coefficient p. 



Lemma 5.2. Under the hypothesis of Theorem \2.1\ for 
any x G G «;e have ^'^'x G G. Further, for any x 7^ y G G, 
letting p=l — (4/5)(l — k) G (0, 1), we have 

Ed(5'<'>x,5'<*V) < pd(x,y) . 

The proof of this lemma can be found in the next section. 
As a consequence of this lemma we can apply Theorem |5.1| 
In particular, we conclude that p is supported on the good 
set G. 

Next consider the estimate u''-* G R" produced by our al- 
gorithm, cf. Eq. Q. This is given in terms of the Markov 
chain on Pn by 

IIELi/(x.)ll ' 

where we define / : Pn i-*- R" such that /(x) is a represen- 
tative of X satisfying ||/(x)|| = 1 and (u, /(x)) > 0. We use 
Ut G Pn to denote an element in Pn represented by G'*' . 

Let p{f) = //(x)/i(dx) G R" be the expectation of /( • ) 
with respect to the stationary distribution (informally, this 
is the baricenter of p). With a slight abuse of notation, we 
let p{f) denote the corresponding element in Pn as well. 
Then, by the triangular inequality, we have, for any t, 

d(u,\J,)<d{u,p{f)) + d{Ut,pif)) . 

The left hand side is the error of our estimate of the leading 
eigenvector. This is decomposed in two contributions: a 
deterministic one, namely d{u, p{f)), that gives the distance 
between the leading eigenvector and the baricenter of p, and 
a random one i.e. d(Ut,^(/)), that measures the distance 
between the average of our sample and the average of the 
distribution. 



In order to bound d(Uf, we use the following fact that 

holds for any a, fe G R" 



{a,by ^ \\a-b\\ 
|aP||6||2 



(12) 



This follows immediately from 2||a||||6|| —2{a,b) < ||a — 
We apply this inequality to a = fi{ f ) and b = (l/t) fO^e)- 
We need therefore to lower bound and || (l/t) X]f=i fO^^) 

and to upper bound \\{l/t) J^Li fO^t) - 

Deno te by Vu the orthogonal projector onto u. From The- 
we know that fJ.{C) = 0. Hence, using 9 < 



5.1 



(l/40)(l-^2)'/^wehave||M(/)|l > > \/l - 1/400. 

Similarly since Xq G G, we have by Al that G G for all 
£, and therefore \\ {l/t) fC^e)]] > - 1/400. We are 

left with the task of bounding ||a — This is done in the 
next lemma that uses in a crucial way Theorem [5T1 

Lemma 5.3. Under the hypothesis of Theorem \2.1\ 



E||^^/(X,)-M(/)|f < 7^ 



706(2 



1=1 



il-l2)H 



Applying Markov's inequality and Eq. (121, we get, with 
probability larger than 1 — 5/2 

d{Vt,^^{f))< 



(i-hWts 



Next, we bound the term d(u, /i(/)) in Eq. (12 1 with the 
following lemma. 



Lemma 5.4. Under the hypothesis of Theorem \2.1\ 

By noting that \\u — < -v/2d(u, Ut), this finishes the 

proof of the theorem. 

6. PROOF OF TECHNICAL LEMMAS 
6.1 Proof of Remark |221 

Assuming initial vector X G R" with i.i.d. Gaussian en- 
tries, we can get close to u by iteratively applying a single 
sparsification S. Define a good set of initial vectors 



, * , . 1 , I , I / 6 log n 

u x\ > ^ ,„ and max m, a; < \ 

~ 7i=/2 ie[ni V n 



Since, u*X's are independent and distributed as N(0, 1/n), it 
follows that we have P(|m*X| > i/(61og n)/n) < and 
P(|u*X|) < 1/n^. Applying union bound, we get P(X G 
J'n) > 1 — S/ri^. Assuming we start from this good set, we 
show that for k large enough, we are guaranteed to have 

||w-x''='|| <e/{i-i2). 

Let {Xi} be the eigenvalues of S such that Ai > jA2| > • ■ • > 
|An,|, and let {ui} be the corresponding eigenvectors. We 



know that Ai > since Ai > A-||5'-M||2 and ||S-M||2 < A 
by assumption. Then, by the triangular inequality. 



-x^'^'ll < \\u- 



To bound the first term, note that 
||M-S'||2 > \u{M~S)u\ 

> X-\i{uuf-X2\\rz±iu)f. 

This implies that {u*uf > (A - A2 - ||M - S'||2)/(Ai -J2). 
We can further apply Weyl's inequality [18], to get |Ai — 
A,j < \\M - S'||2. It follows that {u'uf > (A - A2 - 2\\M - 
5'||2)/(A - A2 + 2\\M - Sy). Note that this bound is non- 
trivial only if \\M - S||2 < (A - A2)/2. Using the fact that 
(1 — i)/(l + a) < (1 — a)^ for any \a\ < 1, this implies that 



\\u — < 



4||M-5||2 



A- A2 



In particular, for ||M-S'||2 < 6'^||M||2/(2(1 - ;2)) as per our 
assumption, this is less than ^/29/{l — h)- 



To bound the second term, we use x*^"' G J-n to get 



\\Skx(0)\\2 



> 



1 + E,:; 



(0))2 



i>2 X^k (5.^(0) )2 

> 1 - (A2/Ai)^'=6n'^logn 

- 4(1-/2)2- 

In the last inequality we used k > 31og(n/&)/(l — h — 0), 
and the fact that (A2/A1) < h + 0. Then, \\u - x(*'|| < 
S/(\/2(l — h))- Collecting both terms, this proves the de- 
sired claim. □ 



6.2 Proof of Lemma 1121 

Lemma 6.1 (Contraction). For a given v < (1/20), 
assume that X , x' satisfy \\x\\ — \\x'\\ = 1, {u,x} > 0, {u,x'} > 
0, ||'Pu^(2;)|| < V, and \\'Puj^{x')\\ < v. Then, under the hy- 
pothesis of Lemma \5.2\ we have 

Qx Qx 



< {l2{l + Zv')+W)\\z- z'W , 

(13) 



and 



Vu 



Qx' 



< (4iy + m\\z- z' 



\Qx\\ WQx' 

where I2 = \X2\/X, z = Vu^ {x) and z — Vu^ {x). 



(14) 



Proof. By the assumption that (u, a;) > and {u, x') > 
0, we have x = -^/l — + z and x' = \/l — H^'Pu + z' . 

The following inequalities, which follow from HQ — A'/||2 < 
9X, will be frequently used. 

(1-6»)A < \\Qu\\ < {l + e)X , 
(/2-6»)A < \\Qz\\ < {l2 + 9)X. 
The following inequalities will also be useful in the proof. 

\\x~x'\\<{l/Vl^)\\z^z'\\, (15) 



where we used \/l — — \/\ — \P- < — i''^)\a — b\ for 

\a\ < V and |6| < v. Similarly, using the fact that Mu and 
Mz are orthogonal 



IIQxIl > {^\-\\z\\'^)\\Mu\\ - HQ - M||2 

Next, we want to show that 

< . \\z - z \\ . 



II 



\\Qx'\\ 



(^/^ 



(16) 
(17) 



We use the equality 1/a — 1/6 = (a^ — b'^)/{ab{a + b)) with 
a = ||Qx|| and b — \\Qx'\\. The denominator can be bounded 
using (161. It is enough to bound |||Qa;'|p — ||(5a;|p| using 



\\\Q{^l-\\zPu + z)f- \\Qi^l-\\z'Pu + z') 
< \\\zf - \\z'f\\\Quf + \\\Qzf - WQz'fl 



+ 2\{^1 - \\zPz - ^l-\\z'Pz'yQ*Qu\ . 

Note that ^\\zf - \\z'f^\\Quf < 2iy{l + ef\^\\z - z'\\, and 

lllgzf - \\Qz'f\ < 2u{l2 + ef\^\\z - z'\\. The last term 
can be decomposed into 

2| (Vl- PPz - ^/l-\\z'\\■^z'yQ*Qu\ 
< 2\^l- - ^1-P'||2| \z*Q*Qu\ 
+ 2y/l - \\z'\\^ \{z - z'YQ'Qul . 



Note that 1^1 - ll^P-^l - ||z'||2| < (!//Vl - iy^)\\z-z'\\, 
\z*Q*Qu\ < \^e{h + 0), and \{z - z'YQ'Qu] < X^h + 
— Collecting all the terms and assuming 9 < 1/40 
and u < 1/20, ||lga;|l - ||ga:'||| < (4.4;/ + O.W)X^\\z - z'\\. 
this implies ( 17 1. 



To prove (|T3|, define Ti = P^± {Qx -Qx')/\\Qx\\ and T2 = 
'Pn^{Q^')W/\\Q^\\) - (l/ll<3a;'||)). We bound each of these 
separately. 



llTill = 



\r^± {M{x-x') + {Q- M){x- 



hWz-z'w +e\\x~x' 



where (a) follows fro m ||16[ ) and the fact that V^±Mu — 0, 
and (6) follows from (|15[). Similarly, using (171 



\\T2\\^\\r^.{Q{Vi^W¥u + z'))\ 

2.2^ + 0.161 „ 



< (e + uh 



Notice that by assumption, we have 9 < (1/40), and by the 
definition of G in ( |11[ ), we have u < (1/20). Then, after 
some calculations, we have proved ( |13[ |. Analogously we 
can prove ( |14[ ) by bounding T3 = Vu(Qx — Qx') /\\Qx\\ and 
T4, = Vu{Q^{{l/\\Qx\\) - (l/||ga;'|l)) separately. □ 



We are now in position to prove Lemma [5. 2| 



Proof of Lemma |5.2[ We first show that for any x G G 
with a representative x such that {x, u) > 0, we have Qx € 
G. Note that, by triangular inequality, (gw)|| < 9\ 

and |lg«|l > (1 — 9)\- Applying Lemma 6.1 to x and u, we 
get 





Qx \ 

\Qx\\) 







{m\)h('<'^'''^)^'') ii^^.(--)ii 
/)y. (18) 



+ 3jyje+ [12(1 

For < (1/40) and for and 1/ satisfying 



l-h 



< V < min • 



15 



20} ' 

the right-hand side of (fTsl is always smaller than v, since 
{{l/{l-e)) + 2,v)e < (3/5)(l-Z2)^and 3^^ < (2/5)(l-Z2). 
This proves our claim for 9 < (1/40)(1 - hf^^ and f G 
[{26) /{I — I2), \/l — ^2/20] as per our assumptions. 

Next, we show that there is a contraction in t he s et G. For x 
and x' satisfying the assumptions in Lemma [6.1[ define y = 
Ax/\\Ax\\, y' = Ax'/\\Ax'\\, z = P^±{x), and V^±{x'). 
For ||a;|| < 1/ and ||a::'|| < ^ we have 

1 - 2^^ < {x,x') < 1. 

Using the above bounds we get 



< 



1 



{y,y') 



1 - {X,X')^ - (1 - ^2) 1 _ ^2-^3./^ • 

We can further bound (1 — {y, y'))/{i — {x, x')) using Lemma 



h-y'W < \l {h + + 3^)" + (4i/ + 4^)" \\z - z'W 
Using \\z ~ z'W^ < \\x - x'f = 2- 2{x,x'), we get 



-j^^ < {h + 3i/' + 36)' + (4^ + 46)' 

[x^ X ) 



For 9 < (1/40) (1 - h)^^^ and u < ^1^/20 as per our 
assumptions, it follows, after some algebra, that 



< 



1 - {x,x')^ 
for p > 1 - 0.8(1 - I2) 



{h + 3^2 + 3g)2 _^ (4^ 4, 4g)2 

1 - 1/2 



6.3 Proof of Lemma 15.31 

Expanding the summation, we get 

\\^-J2fi^s)-Kf)f 

s = l 

iX]ll/(X.)-M(/)ll'+ 



i2 



f2 



^^</(X.)-M(/),/(X.)-M/)> 



where (a, b) = a*b denotes the scalar product of two vectors. 
We can bound the first term by 209'^ /{t{l — h)'^), since 



||/(X3)-M(/)f < 



206*^ 



(19) 



where we used ||P„(/(X,) - fi{f))f < 4,9^/(1 - hf and 
\\V^^ (/(X,) - M(/))f < 16eV(l - 12? for X, G G. 

To bound the second term, let y = /(Xr) — ^J■{f)■ Note 



that by ([T9|, ||y|| < V209/{1 - h). We apply Theorem 
together with Lemma [5. 2| to get 



A.3 



]E[//(X,)|X.] <p=-'-||yf , 

for r < s. Using the fact that for |p| < 1, X^'—i X]r<s p"'' 
E*=i P"V^V(1 - P) < Pi/(1 - P), it follows that 

t 

^ ^ E [</(X0 - ^(/), /(X,) - /.(/)>] 

r — 1 r<s 
t 

2061 



2061^ 



EE/ 



(1-'2)(1-P) 



Combining the above bounds we get 

E||^E/(x=)-M/)ir^^+ 



t{l-hf {l-l2){l-p)f 



For p = 1 — (4/5)(l — h) as in Lemma 5.2 this proves the 
desired claim. □ 



6.4 Proof of Lemma 

we know p(G'^ 



From Theorem 



5.1 



0. This imples that 



Mf)f > l|K(M7))f > 1 - 1/400. Then, 

rf(u,p(/)) = "^ji^l^lJ^^" < 2||P„4m(/))II 



Let X be an random element in P„ following the stationary 
distribution fJ.{-), and the random vector X £ K" be the rep- 
resentative. From the definition of f(-), 'P^± (X) is invariant 
when we apply /(•), whence Vu^^ (/(X)) = Vuj^ {X). We can 
bound with the following recursion. 



= E 



+ 



\\UlrS(^)x\\ r 
p., ( n s'''^) (||ntiswx|| " 



||M'=X|| 



1 



E|P„,(n5<'^^yV,|nt,SWX|| IIAf^XII 



+ 



WMi^XW 



Let 1/ = 29/(1 — Z2). To bound the second term, note that 
niG") = 0. This imples that > \\Vu{X)\\ > 
and (X) < V with probability one. Then, 

||P„,(Af'=X)|| V 

S '2 - 



with probability one. To bound the first term, we use the 
telescoping sum 

fc k k 

Applying the triangular inequality of the operator norm, we 
have 

k 

||]JS<''-A/i < A'=((1 + 6I)'= - 1) , 
fci 

which follows from (llL^+i S^^') (A/*'' -M)M*-^ < A'=(1 + 

9)^~^9. Using the above inequality, we get the following 
bounds with probability one. 

\Pu^ ( n ^'''^) II ^ 11^"^ (^■^'^) II + II ( n ^ *^')^|| 

< \^{llv+{l + 9f - 1) , and 

l|p.u(n'5'''^)|| > ii^4Af^x)ii-||(n5''^-Af')x|| 



> A'(\/r^-(l + 6i)'= + l) . 



Then it follows that, 
1 



iiAf'=xiiiinLi 'S'(''^ii 



~ X^y/l - !/2(Vl - - (1 + e)*^ + 1) 

Collecting all the terms, we get 

2((i + e)'=-i + zti/)((i + 6i)*=-i) 21'iv 



rf(u,p(/)) < 



Vl - V^W^ - 1/2 - (1 + 6i)fe + 1) ^ 



Let fc = [log(6l)/log(/2)l such that I2 < 9. From the as- 
sumption that 9 < (1 - hf^'^/AO, it follows that 61 log e < 
0.12(1 - ^2). Then, 



(1 + gy-°eS/ iogi2) _ I <^ g{eioge/iogi2) _ -j^ 

1.1 



< 



(1-^2) 



9log{l/9) 



Then, after some algebra, (1 + e^)*^ - 1 < 61 + (1 + 6')((1 + 
^)(ioge/iog;2)_;^) < 1.561 log(l/6»)/(l-Z2), and (1 + 9)'' -1 + 
l^v < 1.561 log(l/6»)/(l - I2). It also follows that Vl - > 
x/399/400 and (Vl - i^'^ - (1 + 6')'= + 1) > 0.8. Collecting 
all the terms, we get the desired bound on d(u, /i(/)). 



7. EIGENVALUE ESTIMATION 

In the previous sections, we discussed the challenging task 
of computing the largest eigenvector under the gossip set- 
ting. A closely related task of computing the largest eigen- 
value is also practically important in many computational 
problems. For example, positioning from pairwise distances 
requires the leading eigenvalues, as well as the leading eigen- 
vectors, to correctly find the positions [24]. In the following, 
we present an algorithm to estimate the leading eigenvalue 
under the gossip setting and provide a performance guar- 
antee. Although the proposed algorithm uses the same tra- 
jectory {a;'*'} from GossiP PC A, the analysis is completely 
different from that of the eigenvector estimator. 

We assume to have at our disposal the random trajectory 
{2:''^}t>o, defined as in ([2|, possibly from running GOS- 
SIP PGA. Assume that we start with a;'''' with entries dis- 
tributed as N(0, 1). Our estimate for the top eigenvalue A 
after t iterations is 



1/t 



Although, this estimator uses the same trajectory {a:^*^}t>o 
as Gossip PGA, the analysis significantly differs from that 
of the eigenvector estimator. Hence, the statement of the er- 
ror bound in Theorem 1 7. 1| is also significantly different from 
Theorem [2?T| The main idea of our analysis is to bound 
the second moment of the estimate and apply Ghebyshev's 
inequality. Therefore, the second moment of 5*'*' character- 
ized by a determines the accuracy of the sparsification. 



max Var(S', 



ij J 



< ia/n)\\M\\l 



(20) 



The trade-off between d, which determines the complexity, 
and a depends on the specific sparsification method. With 
random sampling described in Section|3j it is not difficult to 
show that Eq. ([20| holds with only a = 0{l/d). 



Our main result bounds the error of the algorithm in terms 
of a, t, and 7 = X^f^id'^'l/'^)- proof of this theorem is 
outline in the following section. 



Theorem 7.1. Let{S^^^e>i be a sequence of i.i.d. nxn 
random matnces satisfying = M and Eq. (20 1. As- 

sume a < 1/2 and maxjlogj n, 2 logfi/;^) '"A <t < n/(4Q!7). 
Then with probability larger than 1 — max{5, 16/n^} 



A<') - A 
A 



< max 



8^2 ^ 

tn\/S' 



32 



1^ ' 



07^ (log n) 2 
tnS 



provided the right hand side is smaller than 1/t. 

7.1 Proof of Theorem O 

The proof idea is fairly simple. Let xo = a;^"^ and define 
a'*' = xSx^*' such that our estimator is A'*' = 
We will show that this is close to the desired result A by 
applying Chebyshev inequality to A'*-*. In order to do this 
we need to compute its mean and variance. 



Lemma 7.2. Consider the two operators A,B 



defined as follows 

A{X) = MXM* , 



(21) 

B{X) = (22) 
where {X,Y) — Tr{X*Y) . Then, conditional on xo we have 

E[\''^\xo]=x*oM'xo, 
Var{X^^^\xo) < {xoXq, {A + BY (xoXq)) - {xqxI, A^xox^)) . 

The next lemma provides a bound on the variance. 

Lemma 7.3. Let A,B : R"""" ^ M"""" be defined as m 
Eqs. (21) and , 22). Further assume a < 1/2 andatj < n/4. 
Then, for any two vectors x, y £ R", ||a;|| = \\y\\ = 1, 

I (yy* ,iA + BY{xx'))- {yy* , A' {xx*)) j 

2tfnQ* + 8a^7 / ^ \Xi\ 



< 4A 



4n2 

at J ( ^ |Ai 

j=i 



For the proof of Lemmas |7.2| and |7.3| we refer to the longer 
version of this paper. Let Qn be any measurable subset of 
R". This forms a set of 'good' initial condition xo, and its 
complement will be denoted by CJn- With an abuse of nota- 
tion, Qn will also denote the event xq £ Qn (and analogously 
for Qn). Also let, A = A^'*. Then, for any A > 0, 



P{A ^ [A(l - A)'/*, A(l + A)'/*]} 

<p{{|A*-A'| > AA*|e;„} + p{e„} 

< A4^E{(A*-A*)^|g4+P{g4 



sup Var(A'jxo)+P{e4. 



We shall upper bound each of the three terms in the above 
expression with 



Qn = \ x :max|ii*2:|< \/ 6 log n > . 



(23) 



Notice that P{«a;o)^ > 61ogn} < 2/n^. By the union 
bound we get P{a;o G Qn} < 2/n^. 



Next observe that 



1 - \ " X* 

^E{A*| Qn}-1= m<Xof\Qn} - 1) + E 



Ui Xo) 



The first step can be computed as 



E{{ulxoy\Qn} 



E{{utxof}-E{iutxoflgJ 



l-P{a;o e Qn} 

whence, recalling that E{(itia;o)^} — 1, and P{a::o G Qn} < 
1/2 for all n large enough, we get 

eQn} -E{(ulxofl^^^^g^y} 

i-H^oeQn} 

4 



\E{{u:Xof\Qn} - l\ = 



Note further that, by Chernoff inequahty, X^ILiC'^i^o)^ < 
3n with probability at least 1 — exp{(l/10)n}. Then, 



|^E{A'|g4-l|<J, 



3n 



1^ 
A 



Finally, using Lemma [7.2| and |7.3| we get 



^Var(A'|a;o) <4n'| 



2a^7 



+ 2—^ } —(UiXo) 

TJ ^ ^ A 



A J 



4n 



Further, for any xo G Gn, I]"=i(|A,|/A)«a;o)^ < 67I 
and therefore 
1 



A2> 



Var(A'|2:o) 



2a^7 ^ 12a7' 



^ , rSaS 12a7 
< 4n<^ H ^- 



3/2 



log n 36af7 (log n) 



n n 

3/2 



3/1 ,\2 . 



(logn) 36of7 (logn) 



+ 



<4{l5a7^/^(logn)+''"^^'f^°g"^'} 



where we used the fact that, for a < 1/2 and t > log2(n) 
we have Qf*/4 < 0^7/71. 



Collecting the various terms we obtain 
P{A ^ [A(l - A)'/*, A(l + A)'/*]} 

1 f 4 , „ ,t 1 ^ , 4a log n 



2 

^ — + 



. 4alognr 



3/2 ^ 36t7 (logn) 



whence P{A ^ [A(l - A)^''*,A(1 + A)^/*]} < 5, provided 
n > 4/VS, A > 4^/(nV5), A > 2n(|A2|/A)7^, A^ > 
(240Q7^/^logn)/5, A^ > 4 • 144at7^(logn)V(n5). The the- 
sis follows by noting that (1 + A)^/* < 1 + (A/t) and (1 - 
A)^/' > 1 - 2(A/t) provided A < 1/2. 
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APPENDIX 

A. PROOF OF THEOREM 5.1 

We start by restating the main result of ^23^ , in a somewhat 
more exphcit form. Recall that / : P„ — >■ R is said to be 
A-Holder continuous if its Holder coefficient, defined by 

l/(x)-/(y)| 



[/]. 



sup ■ 



d(x,y)^ 



(24) 



is finite. 



Theorem A. 1 (Le Page, 1982). Under assumptions LI 
and L2 there exists a unique measure on P„ that is sta- 
tionary for the Markov chain {Xt}. Further, there exists 
constants A > 0, p £ {0, 1), X G (0, 1] such that, for any 
X-Holder function / : P„ — >■ R, 

|E{/(Xt)}-M(/)l < 



Remark: The above follows immediately from Theorem 1 
in [23] via a simple coupling argument. Notice in particular 
that it applies to any Lipschitz function since [/]a is upper 
bounded by the Lipschitz modulus of /. 

Next we restate and prove the first part of Theorem |5.1[ 



Theorem A. 2. Assume conditions LI and L2 hold, to- 
gether with Al, A2. Denote by fj, the unique stationary mea- 
sure of the Markov chain {Xt}t>o. Then 



m(g^) = o. 



(25) 



Proof. Consider a Markov chain MCi with xq G G. The 
Markov chain MCi has a stationary distribution because 
conditions LI and L2 hold. From the property Al, we know 
that Xt € G. Therefore the stationary distribution of A/Ci, 
say fii, satisfies fii{G'^) = 0. From Theorem A.l we know 



that the stationary distribution is unique. Therefore fi = fii 
and hence /i(G'') = 0. □ 



It is further convenient to introduce, for f G N, A > the 
quantity 

r d(Xt,Yt iiA, 



Pa 



(a) 
< 



Proof. First notice that the function t 1— >■ px{t) is sub- 
multiplicative. This follows from 

P4t.+t,)= sup 

= ,n ^[r c'(Xti,YtJ -|^r d(Xt,+t2,Yti+t2) l^] 
.^^yPgG Ud(Xo,Yo)J L d(Xt,,Yt,) J J 

rrd(Xt^^]A. r d(Xt,,Yt,) ]^ 
m VI f ^''P m V 1 

= Px{tl)px{t2) . 

where (a) follows from the condition Al. From the condition 
A2', we know that p\{l) < p, hence p\{t) < p*. 

Next let {Xi}t>o and {Yt}t>o be Markov chains coupled as 
above, with initial conditions Xo = xo G G and Yo ~ p. We 
then have 

|E{/(XO}-K./)| = |E{/(Xt)}-E{/(Yt)}| 
<E{|/(XO-,f(YO|} 

< [/]AE{d(Xt,YO^} 

< [f]x Px{t) < [f]xp , 

where we used d(Xo, Yo) < 1 by definition. This concludes 
our proof. □ 



Finally, we will state and prove a generalization of the second 
part of Theorem |5.1[ For this we generalize hypothesis A2 
as follows. 

A2'. For any X 7^ y G G, E [d(5''*'x, ^'^'y)^] < prf(x,y)^. 



Theorem A. 3. Assume conditions LI and L2 hold, to- 
gether with Al and A2'. Denote by p the unique stationary 
measure of the Markov chain {Xt}f>o. Let xo G G. Then 
for any X-Holder function / : P„ — >■ R, we have 

|E{/(XO}-M,/)| < P [/]a. (26) 



The proof of this theorem is based on a coupling argument. 
The coupling assumed throughout is fairly simple: given ini- 
tial conditions xo, yo G G, we define the chain {(Xt, Yt)}t>o 
by letting (Xo, Yo) = (xo,yo) and, for all t > 1, 



Xt = • • • S'^^xo , Yt = • • • S^'^yo . (27) 



